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Abstract — Background subtraction has been a driving engine 
for many computer vision and video analytics tasks. Although 
its many variants exist, they all share the underlying assumption 
that photometric scene properties are either static or exhibit 
temporal stationarity. While this works in some applications, the 
model fails when one is interested in discovering changes in scene 
dynamics rather than those in a static background; detection of 
unusual pedestrian and motor traffic patterns is but one example. 
We propose a new model and computational framework that 
address this failure by considering stationary scene dynamics 
as a "background" with which observed scene dynamics are 
compared. Central to our approach is the concept of an event, 
that we define as short-term scene dynamics captured over a time 
window at a specific spatial location in the camera field of view. 
We compute events by time-aggregating motion labels, obtained 
by background subtraction, as well as object descriptors (e.g., ob- 
ject size). Subsequently, we characterize events probabilistically, 
but use a low-memory, low-complexity surrogates in practical 
implementation. Using these surrogates amounts to behavior 
subtraction, a new algorithm with some surprising properties. 
As demonstrated here, behavior subtraction is an effective tool 
in anomaly detection and localization. It is resilient to spurious 
background motion, such as one due to camera jitter, and is 
content-blind, i.e., it works equally well on humans, cars, animals, 
and other objects in both uncluttered and highly-cluttered scenes. 
Clearly, treating video as a collection of events rather than 
colored pixels opens new possibilities for video analytics. 

Index Terms — Video analysis, activity analysis, anomaly de- 
tection, behavior modeling, video surveillance. 



I. Introduction 

MANY computer vision and video analytics algorithms 
rely on background subtraction as the engine of choice 
for detecting areas of interest (change). Although a number 
of models have been developed for background subtraction, 
from single Gaussian [?] and mixture of Gaussians [?] to non- 
parametric kernel methods they all share the underlying 
assumption that photometric scene properties (e.g., luminance, 
color) are either static or exhibit temporal stationarity. The 
static background assumption works quite well for some appli- 
cations, e.g., indoor scenes under constant illumination, while 
the temporally- stationary background assumption is needed in 
other cases, such as outdoor scenes with natural phenomena 
(e.g., fluttering leaves). However, both models fail if one is 
interested in discovering changes in scene dynamics rather 
than those taking place in a static background. Examples of 
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such scenario are: detection of unusual motor traffic patterns 
(e.g., too fast or too slow), detection of a moving group 
of individuals where a single walking person is expected, 
detection of a moving object against shimmering or turbulent 
water surface (background motion). Although each of these 
challenges can be addressed by a custom-built method, e.g., 
explicitly estimating object trajectories or discovering the 
number of moving objects, there is no approach to-date that 
can address all such scenarios in a single framework. 

In order to address this challenge, instead of searching for 
photometric deviations in time, one should look for dynamic 
deviations in time. To date, the problem has been attacked 
primarily by analyzing two-dimensional motion paths resulting 
from tracking objects or people [?], [?], [?], [?], [?]. Usually, 
reference motion paths are computed from a training video 
sequence first. Then, the same tracking algorithm is applied 
to an observed video sequence, and the resulting paths are 
compared with the reference motion paths. Unfortunately, 
such methods require many computing stages, from low-level 
detection to high-level inferencing [?], and often result in 
failure due to multiple, sequential steps. 

In this paper, we propose a new model and computational 
framework that extend background subtraction to, what we 
call, behavior subtraction [?], while at the same time address- 
ing deficiencies of motion-path-based algorithms. Whereas in 
background subtraction static or stationary photometric prop- 
erties (e.g., luminance or color) are assumed as the background 
image, we propose to use stationary scene dynamics as a 
"background" activity with which observed scene dynamics 
are compared. The approach we propose requires neither 
computation of motion nor object tracking, and, as such, is 
less prone to failure. Central to our approach is the concept 
of an event, that we define as short-term scene dynamics 
captured over a time window at a specific spatial location 
in the camera field of view. We compute events by time- 
aggregating motion labels and/or suitable object descriptors 
(e.g., size). Subsequently, we characterize events probabilisti- 
cally as random variables that are independent and identically 
distributed {iid) in time. Since the estimation of a probability 
density function (PDF) at each location is both memory- 
and CPU-intensive, in practical implementation we resort 
to a low-memory, low-complexity surrogate. Using such a 
surrogate amounts to behavior subtraction, a new algorithm 
with some surprising properties. As we demonstrate experi- 
mentally, behavior subtraction is an effective tool in anomaly 
detection, including localization, but can also serve as motion 
detector very resilient to spurious background motion, e.g., 
resulting from camera jitter. Furthermore, it is content-blind, 
i.e., applicable to humans, cars, animals, and other objects in 
both uncluttered and highly-cluttered scenes. 
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This paper is organized as follows. In Section HH we review 
previous work. In Section|lIIl we recall background subtraction 
and introduce notation. In Section [iVl we introduce behavior 
space and the notion of an event, while in Section |Vl we 
describe the behavior subtraction framework. In Section IVll 
we discuss our experimental results and in Section IVIII we 
draw conclusions. 

II. Previous work 

There are two fundamental approaches to anomaly detec- 
tion. One approach is to explicitly model all anomalies of 
interest, thus constructing a dictionary of anomalies, and for 
each observed video to check if a match in the dictionary 
can be found. This is a typical case of classification, and 
requires that all anomaly types be known a priori. Although 
feasible in very constrained scenarios, such as detecting people 
carrying boxes/suitacases/handbags [?], detecting abandoned 
objects [?] or identifying specific crowd behavior anomalies 
[?], in general this approach is not practical for its inability to 
deal with unknown anomalies. 

An alternative approach is to model normality and then 
detect deviations from it. In this case, no dictionary of anoma- 
lies is needed but defining and modeling what constitutes 
normality is a very difficult task. One way of dealing with this 
difficulty is by applying machine learning that automatically 
models normal activity based on some training video. Then, 
any monitored activity different from the normal pattern is la- 
beled as anomaly. A number of methods have been developed 
that apply learning to two-dimensional motion paths resulting 
from tracking of objects or people [?]. Typically, the approach 
is implemented in two steps. In the first step, a large number 
of "normal" individuals or objects are tracked over time. 
The resulting paths are then summarized by a set of motion 
trajectories, often translated into a symbolic representation of 
the background activity. In the second step, new paths are 
extracted from the monitored video and compared to those 
computed in the training phase. 

Whether one models anomaly or normality, the background 
activity must be somehow captured. One common approach is 
through graphical state-based representations, such as hidden 
Markov models or Bayesian networks [?], [?], [?], [?], [?]. 
To the best of our knowledge Johnson and Hogg [?] were 
the first to consider human trajectories in this context. The 
method begins by vector-quantizing tracks and clustering the 
result into a predetermined number of PDFs using a neural 
network. Based on the training data, the method predicts 
trajectory of a pedestrian and decides if it is anomalous or not. 
This approach was subsequently improved by simplifying the 
training step [?] and embedding it into a hierarchical structure 
based on co-occurrence statistics [?]. More recently, Saleemi 
et al. [?] proposed a stochastic, non-parametric method for 
modeling scene tracks. The authors claim that the use of 
predicted trajectories and tracking method robust to occlusions 
jointly permit the analysis of more general scenes, unlike other 
methods that are limited to roads and walkways. 

Although there are advantages to using paths as motion 
features, there are clear disadvantages as well. First, tracking 



is a difficult task, especially in real time. Since the anomaly 
detection is directly related to the quality of tracking, a 
tracking error will inevitably bias the detection step. Secondly, 
since each individual or object monitored is related to a single 
path, it is hard to deal with people occluding each other. For 
this reason, path-based methods aren't well suited to highly- 
cluttered environments. 

Recently, a number of anomaly detection methods have 
been proposed that do not use tracking. These methods work 
at pixel level and use either motion vectors [?], [?], [?] or 
motion labels [?], [?], [?] to describe activity in the scene. 
They all store motion features in an image-like 2D structure 
(be it probabilistic or not) thus easing memory and CPU 
requirements. For example, Xiang et al. [?] represent moving 
objects by their position, size, temporal gradient and the so- 
called "pixel history change" (PHC) image that accumulates 
temporal intensity differences. During the training phase, an 
EM-based algorithm is used to cluster the moving blobs, 
while at run-time each moving object is compared to the 
pre-calculated clusters. The outlying objects are labeled as 
anomalous. Although the concept of PHC image is somewhat 
similar to the behavior image proposed here, Xiang et al. 
do not use it for anomaly detection but for identification of 
regions of interest to be further processed. 

A somewhat different approach using spatio-temporal in- 
tensity correlation has been proposed by Shechtman and Irani 
[?]. Here, an observed sequence is built from spatio-temporal 
segments extracted from a training sequence. In this analysis- 
by- synthesis method, only regions that can be built from large 
contiguous chunks of the training data are considered normal. 

Our approach falls into the category of methods that model 
normality and look for outliers, however it is not based on 
motion paths but on simple pixel attributes instead. Thus, it 
avoids the pitfalls of tracking while affording explicit modeling 
of normality at low memory and CPU requirements. Our con- 
tributions are as follows. We introduce the concept of an event, 
or short-term scene dynamics captured over a time window at a 
specific spatial location in the camera field of view. With each 
event we associate features, such as size, direction, speed, busy 
time, color, etc., and propose a probabilistic model based on 
time- stationary random process. Finally, we develop a simple 
implementation of this model by using surrogate quantities 
that allow low-memory and low-CPU implementation. 

III. Background Subtraction: Anomaly Detection 
IN Photometric Space 

We assume in this paper that the monitored video is cap- 
tured by a fixed camera (no PTZ functionality) that at most 
undergoes jitter, e.g., due to wind load or other external factors. 

Let / denote a color video sequence with It{x) denoting 
color attributes (e.g., G, B) at specific spatial location x 
and time t. We assume that It{x) is spatially sampled on 2-D 
lattice A, i.e., x G A c i?^ is a pixel location. We also assume 
that it is sampled temporally, i.e., t = kAt, k e Z, where At 
is the temporal sampling period dependent on the frame rate at 
which the camera operates. For simplicity, we assume A = 1 
in this paper, i.e., normalized time. We denote by It a frame, 
i.e., a restriction of video / to specific time t. 
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In traditional video analysis, color and luminance are pivotal 
quantities in the processing chain. For example, in back- 
ground subtraction, the driving engine of many video analysis 
tasks, the color of the background is assumed either static 
or stationary. Although simple frame subtraction followed 
by thresholding may sometimes suffice in the static case, 
unfortunately it often fails due to acquisition noise or illumi- 
nation changes. If the background includes spurious motion, 
such as environmental effects (e.g., rain, snow), fluttering tree 
leaves, or shimmering water, then determining outliers based 
on frame differences is insufficient. A significant improvement 
is obtained by determining outliers based on PDF estimates of 
features such as color. Assume that Prgb is a joint PDF of the 
three color components estimated using a 3-D variant of the 
mixture-of-Gaussians model [?] or the non-parametric model 
[?] applied to a training video sequence. Prgb can be used 
to test if a color at specific pixel and time in the monitored 
video is sufficiently probable, i.e., if PRGB{h{^)) > where 
r is a scalar threshold, then It{x) is likely to be part of the 
modeled background, otherwise it is deemed moving. 

Although the thresholding of a PDF is more effective than 
the thresholding of frame differences, it is still executed in the 
space of photometric quantities (color, luminance, etc.), and 
thus unable to directly account for scene dynamics. However, 
modeling of background dynamics (activities) in the photo- 
metric space is very challenging. We propose an alternative 
that is both conceptually simple and computationally efficient. 
First, we remove the photometric component by applying 
background subtraction and learn the underlying stationary 
statistical characterization of scene dynamics based on a two- 
state (moving/static) renewal model. Then, we reliably infer 
novelty as a departure from the normality. 

IV. Behavior Space: From Frames to Events 

As color and luminance contain little direct information on 
scene dynamics, we depart from this common representation 
and adopt motion label as our atomic unit. Let Lt{x) be a 
binary random variable embodying the presence of motion 
(L = 1) or its absence (L = 0) at position x and time t. Let 
lt{x) be a specific realization of Lt{x) that can be computed 
by any of the methods discussed in Section Hill or by more 
advanced methods accounting for spatial label correlation [?], 
[?], [?]. 

While some of these methods are robust to noise and back- 
ground activity, such as rain/snow or fluttering leaves, they of- 
ten require a large amount of memory and are computationally 
intensive. Since simplicity and computational efficiency are 
key concerns in our approach, we detect motion by means of 
a very simple background subtraction method instead, namely 

lt{x) = \It{x)-ht{x)\>T, (1) 

where r is a fixed threshold and ht is the background image 
computed as follows 

ht+i{x) = (1 - p)ht{x) + plt{x) (2) 

with p in the range 0.001-0.01. This linear background update 
allows to account for long-term changes. Although this method 
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Fig. 1. (a) Video frame It=to captured by a vibrating camera and the 
corresponding motion label field It^tQ. (b) Binary waveforms show- 
ing the time evolution of motion labels / at two locations (marked C 
and D in (a)), (c) Behavior signatures at the same locations computed 
using the object- size descriptor ([3]). The pixel located near intensity 
edge {D) is "busy", due to camera vibrations, compared to the pixel 
located in a uniform-intensity area (C). The large bursts of activity 
in behavior signatures correspond to pedestrians. 

is sensitive to noise and background activity, it is trivial to 
implement, requires very little memory and processing power, 
and depends on one parameter only. Clearly, replacing this 
method with any of the advanced techniques will only improve 
the performance of our approach. 

Fig. [T] shows an example realization of motion label field Lt 
computed by the above method as well as a binary waveform 
showing temporal evolution of motion label at specific location 
X (Fig. [Bb). Each such waveform captures the amount of 
activity occurring at a given spatial location during a certain 
period of time and thus can be considered as a simple behav- 
ior signature. For instance, patterns associated with random 
activity (fluttering leaves), periodic activity (highway traffic), 
bursty activity (sudden vehicle movement after onset of green 
light), or no activity, all have a specific behavior signature. 
Other behavior signatures than a simple on/off motion label 
are possible. 

a) Object descriptor: A moving object leaves a behavior 
signature that depends on its features such as size, shape, 
speed, direction of movement, etc. For example, a large 



4 



moving object will leave a wider impulse than a small object 
(Fig. [Ub), but this impulse will get narrower as the object 
accelerates. One can combine several features in a descriptor 
in order to make the behavior signature more unique. In fact, 
one can even add color/luminance to this descriptor in order 
to account for photometric properties as well. Thus, one can 
think of events as spatio-temporal units that describe what type 
of activity occurs and also what the moving object looks like. 

Let a random variable F embody object descriptiorH, with 
/ being its realization. In this paper, we concentrate on object 
descriptor based on moving object's size for two reasons. First, 
we found that despite its simplicity it performs well on a wide 
range of video material (motor traffic, pedestrians, objects on 
water, etc.); it seems the moving object size is a sufficiently 
discriminative characteristic. Secondly, the size descriptor can 
be efficiently approximated as follows: 



1 



NxN 



J2 S{kix),k{y)), 



(3) 



where J\f{x) is an x window centered at x and y m 
X means that y and x are connected (are within the same 
connected component), ^(-j = 1 if and only if lt{x) = lt{y) = 
1, i.e., if both x and y are deemed moving, otherwise S{-) = 0. 
Note that ft{x) = whenever lt{x) = 0. This descriptor is 
zero for a pixel away from the object, increases non-linearly 
as the pixel moves closer to the object and saturates at 1.0 for 
pixels inside a large object fully covering the window AT. 

Fig. [T]c shows an example of behavior signature based on 
the size descriptor. Clearly, ft{x) = means inactivity while 
ft{x) > means activity caused by a moving object; the larger 
the object, the larger the ft{x) until it saturates at 1. The video 
frame shown has been captured by a vibrating camera and 
thus a noisy behavior signature for pixel "D" that is close to 
an intensity edge. 

b) Event model: An event needs to be associated with 
a time scale. For example, a short time scale is required to 
capture an illegal U-turn of a car, whereas a long time scale 
is required to capture a traffic jam. We define an event Et{x) 
for pixel at x as the behavior signature (object size, speed, 
direction as the function of time t) left by moving objects 
over a i(;-frame time window, and model it by a Markov model 
shown in Fig. [2l 

For now, consider only the presence/absence of activity (L) 
as the object descriptor. Assuming tt to be the initial busy- 
state probability (L = 1), the probability of sequence {Li = 
k}w = {k-w+i{x), lt-w+2{x), . . . , lt{x)), at location x and 
within the time window W = [t — w -\- l^t], can be written as 
follows: 



Px{{Li — li}w) 



= tt/^ (1 - q)p'' (1 - p)q^^ (1 - q)p'^... 
= iTq^^^^p^^'^{l-q)'^(l-pY (4) 



where the binary sequence of O's and I's is implicitly ex- 
pressed through the busy intervals f3k (Fig.©. Note that m, n 
are the numbers of transitions "moving static" and "static 

is a random vector if the descriptor includes multiple features. 




Fig. 2. Markov chain model for dynamic event E: p,q arc state probabilities 
(static and moving, respectively), and 1 — p, 1 — are transition probabilities. 
Pi, 1^1, p2, denote consecutive busy and idle intervals. With each busy 
interval is associated an object descriptor F, such as its size, speed/direction 
of motion, color, luminance, etc. 

moving", respectively. The last line in ^ stems from the 
fact that the sum of busy and idle intervals equals the length of 
time window W. This expression can be simplified by taking 
negative logarithm: 

- log P^{{Li = k}^) = -logTT- {logq/p)^(3k -wlogp- 

k 

mlog(l — — nlog(l — p), (5) 

t 

= Ao + Ai ^ h{x)^A2Kt{x), 

where Aq, ^1,^2 are constants, the second term measures the 
total busy time using motion labels and K.t{x) is proportional 
to the total number of transitions in time window W at x. 

Thus far we have assumed that the moving object was 
described only by motion labels Lt{x). Suppose now that also 
a descriptor Ft{x), such as the size, is associated with the 
moving object at location x and time t within a busy period 
in time window W, i.e., t G /3k C W. The random vari- 
able (vector) Ft{x) is described by a conditional distribution 
dependent on the state of the Markov process, as illustrated 
in Fig. [21 We assume that Ft{x) is conditionally independent 
of other random variables F^Q(x),to 7^ t when conditioned 
on the underlying state of the Markov process, and that its 
distribution has exponential form when busy and point mass 
when idle: 



P^{Ft = ft I Lt = k) 




k = 0. 



(6) 



where Zi is a partition function and S is the Kronecker delta. 
If the descriptor F includes object size, the above distribution 
suggests that the larger the object passing through x the less 
likely it is, and also that with probability 1 it has size zero in 
idle intervals (consistent with Fig. [21). This is motivated by the 
observation that small-size detections are usually associated 
with false positives when computing Lt. Should F include 
speed, faster objects would be less likely, a realistic assumption 
in urban setting. The model would have to be modified should 
the descriptor include direction of motion (e.g., horizontal 
motion more likely for highway surveillance with a suitably- 
oriented camera) or luminance/color (e.g., all photometric 
properties equally likely). 
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Note that more advanced descriptor models can be in- 
corporated as well. For instance, one can enforce temporal 
smoothness of the descriptor (e.g., size) for object passing 
through location x via a (temporal) Gibbs distribution with 
2-element cliques: 



L = l) = 



Psi{Fi = fi}w 



where {Fi = fi}y\; denotes a sequence of descriptors ap- 
pearing in the temporal window W, and ^4 is a constant. 
This model controls temporal smoothness of the descriptor F, 
and can be used to limit, for example, size variations in time. 
Nevertheless, for simplicity we omit this model in our further 
developments. 

Combining the descriptor model © with the L-based event 
model dim]) leads to a joint distribution: 

Px{{Li = h}w^ {Fi = fi}w) = 

PA{F^ = f^}^V I {L^ = h}w) ' PA{L^ = h}w) HD 
n W^ = f^ I L^ = ^0 ' Px{{U = k}w) 

where the last line stems from the conditional independence 
of Fi's when conditioned on L's assumed earlier. Taking the 
negative logarithm and using equations ([5]) and Q results in: 



t 

E 

k=t-w+l 



h}yv-, {Pi 
lk{x) 



/i}w) 



A' 



(8) 



^ fk{x)lk{x), 



where A'^ accounts for Zi Q and the last term is the sum of 
descriptors in all busy periods in W. Note that the constant 
A2 is positive, thus reducing the probability when frequent 
"moving static" and "static moving" transitions take 
place. The constant Ai may be negative or positive depending 
on the particular values of q and p in the Markov model; 
increasing busy periods within W will lead to an increased 
(g > p) or decreased {q < p) joint probability. 

Note that at each location x the above model implicitly 
assumes independence among the busy and idle periods as well 
as conditional independence of Ft when conditioned on Lt = 
If. This assumption is reasonable since different busy periods 
at a pixel correspond to different objects while different idle 
periods correspond to temporal distances between different 
objects. Typically, these are all independent. 

With each time t and position x we associate an event Ft 
that represents the statistic described in ([8]), namely. 



c) Anomaly Detection Problem: We first describe 
anomaly detection abstractly. We are given data, cc; G C M^. 
The nominal data are sampled from a multivariate density 
^o(-) supported on the compact set Vt. Anomaly detection [?] 
can be formulated as a composite hypothesis testing problem. 
Suppose the test data, uj, come from a mixture distribution, 
namely, /(•) = (1 - O^o(-) + ^^i(-) where ^i(-) is also sup- 
ported on ^. Anomaly detection involves testing the following 
nominal hypothesis 

^0-^ = versus the alternative (anomaly) Hi : ^ > 0. 

The goal is to maximize the detection power subject to false 
alarm level a, namely, Prob(declare Hi \ Hq) < a. Since 
the mixing density is unknown, it is usually assumed to be 
uniform. In this case the optimal uniformly most powerful test 
(over all values of amounts to thresholding the nominal 
density [?]. We choose a threshold r{a) and declare the 
observation, uo, as an outlier according to the following log- 
likelihood test: 



log(^o(^)) ^ r{a) 



Hi 

> 
< 

Ho 



(10) 



Ft{x) 



k=t 



{AiLk{x) 



AsFk{x)Lk{x))^A2JCt{x), 



where r{a) is chosen to ensure that the false alarm probability 
is smaller than a. It follows that such a choice is the uniformly 
most powerful decision rule. Now the main problem that arises 
is that ^o(') is unknown and has to be learned in some way 
from the data. The issue is that co could be high-dimensional 
and learning such distributions may not be feasible. This is 
further compounded in video processing by the fact that it is 
even unclear what uj, i.e., the features, should be. 

It is worth reflecting how we have addressed these issues 
through our specific setup. We are given w video frames, 
It-w-\-ij h-w+2-, • • • h and a specific location and our 
task is to determine whether this sequence is consistent with 
nominal activity or, alternatively, it is anomalous. We also 
have training data that describes the nominal activity. In this 
context, our Markovian model provides a representation for 
the observed video frames. This representation admits a nat- 
ural factorization, wherein increasingly complex features can 
be incorporated, for example through Markov- Gibbs models. 
Furthermore, the log-likelihood is shown to be reduced to a 
scalar sufficient statistic, which is parameterized by a finite set 
of parameters (Aj's in Q). Consequently, the issue of learning 
high-dimensional distribution is circumvented and one is left 
with estimating the finite number of parameters, which can 
be done efficiently using standard regression techniques. The 
(^9)problem of anomaly detection now reduces to thresholding the 
event Ft = Ct according to ([Tob : 



where the constant A'q was omitted as it does not contribute 
to the characterization of dynamic behavior (identical value 
across all x and t) and /C is a random variable associated with 
realization n (number of transitions). The main implication 
of the above event description is that it serves as a sufficient 
statistic for determining optimal decision rules [?]. 

^We have performed extensive experiments ranging from highway traffic to 
urban scenarios and the results appear to be consistent with these assumptions. 



Hi 
> 
< 
Ho 



et{x) . r(a), 



or, explicitly. 



^ {Aih{x) ^ Asfk{x)h{x)) ^ A2^t{x) ^ r(a). (11) 

k=t-w-\-l 



Hi 
> 
< 
Ho 




Video frame It 



Motion label field h 



Fig. 3. Event model PDF estimated for four different pixels. The two pixels 
in traffic lanes have similar histograms due to the fact that their behaviors are 
very similar (continuous highway traffic). The pixel above the traffic is in the 
idle area of the video, so its histogram has a high peak near zero, the pixel 
on the overpass has a bimodal distribution caused by the traffic light. 

Our task is to find an appropriate threshold r{a) so that the 
false alarms are bounded by a. Note that our events are now 
scalar and learning the density function of a 1-D random 
variable can be done efficiently. The main requirement is that 
Et{x) be a stationary ergodic stochastic process, which will 
ensure that the CDF can be accurately estimated: 

1 ' 

~ Yl '^{Et(x)>rj}{et{x)) — > Prob^{£; > T]}, 



w 

k=t-w-\-l 

where lL^Et{x)>r]}{^t{x)) is an indicator function, equal to 1 
when et{x) > r] and otherwise, while Prob^ denotes the 
representative stationary distribution for Et at any time t. For 
Markovian processes this type of ergodicity is standard [?]. 
One extreme situation is to choose a threshold that ensures 
zero false alarms. This corresponds to choosing r(0) = 
max^ et, i.e., the maximum value of the support of all events 
in the training data. 

Although the anomaly detection algorithm we describe 
in the next section requires no explicit estimation of the 
above CDF, it is nevertheless instructive to understand its 
properties. Fig. [3] shows example PDFs for our test statistic 
et{x) estimated from training data using smoothed histograms. 
Note different histogram shapes depending on the nature of 
local activity. 

V. Behavior Subtraction Framework 

In the previous section, we presented object and event mod- 
els, and explained how they fit into the problem of anomaly 
detection. In principle, once the event model is known various 
statistical techniques can be applied but this would require 
significant memory commitment and computational resources. 
Below, we propose an alternative that is memory-light and 
processor-fast and yet produces very convincing results. 

A. Behavior Images 

As mentioned in the previous section, one extreme situation 
in anomaly detection is to ensure zero false alarms. This 
requires a suitable threshold, namely r(0) = maxt et, equal to 




Fig. 4. Behavior subtraction results for the maximum-activity surrogate {TJJ 
on data captured by a stationary, although vibrating, camera. This is a highly- 
cluttered intersection of two streets and interstate highway. Although the jitter 
induces false positives during background subtraction (Lt), only the tramway 
is detected by behavior subtraction; the rest of the scene is considered normal. 

the maximum value of the support of all events in the training 
data. This threshold is space- variant and can be captured by a 
2-D array: 



B{x) 



max et(x), 

te[i,M] 



(12) 



where M is the length of the training sequence. We call B the 
background behavior image [?] as it captures the background 
activity (in the training data) in a low-dimension representation 
(one scalar per location x). This specific B image captures 
peak activity in the training sequence, and can be efficiently 
computed as it requires no estimation of the event PDF; 
maximum activity is employed as a surrogate for normality. 

As shown in Fig. (H the B image succinctly synthesizes the 
ongoing activity in a training sequence, here a busy urban 
intersection at peak hour. It implicitly includes the paths 
followed by moving objects as well as the amount of activity 
registered at every point in the training sequence. 

The event model © is based on binary random variables 
L whose realizations / are computed, for example, using 
background subtraction. Since the computed labels / will be 
necessarily noisy, i.e., will include false positives and misses, 
a positive bias will be introduced into the event model (even 
if the noise process is iid, its mean is positive since labels / 
are either or 1). The simplest method of noise suppression is 
by means of lowpass filtering. Thus, in scenarios with severe 
event noise (e.g., unstable camera, unreliable background 
subtraction) instead of seeking zero false-alarm rate we opt 
for event-noise suppression using a simple averaging filter to 
compute the background behavior image [?]: 



B{x) 



1 ^ 



(13) 



This background behavior image estimates a space- variant bias 
from the training data. A non-zero bias can be considered as 
a temporal stationarity, and therefore normality, against which 
observed data can be compared. 
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B. Behavior Subtraction 

Having defined the zero-false-alarm threshold t(0) or event- 
noise bias via the background behavior image B (11211131) , we 
can now apply the event hypothesis test (fTTI) as follows: 

abnormal 

et{x) - B{x) > e 

normal 

where B is a user- selectable constant allowing for non-zero 
tolerance (6 = leads to a strict test). In analogy to calling B 
a background behavior image, we call Ct an observed behavior 
image as it captures events observed in the field of view of 
the camera over a window of w video frames. The above test 
requires the accumulation of motion labels /, object sizes /, 
and state transitions {nt) over w frames. All these quantities 
can be easily and efficiently computed. 

Clearly, abnormal behavior detection in this case simplifies 
to the subtraction of the background behavior image B, 
containing an aggregate of long-term activity in the training 
sequence, from the observed behavior image e^, containing 
a snapshot of activity just prior to time t, and subsequent 
thresholding. This explains the name behavior subtraction that 
we gave to this method. 

VI. Experimental Results 

We tested our behavior subtraction algorithm for both the 
maximum- and average- activity surrogates on black-and-white 
and color, indoor and outdoor, urban and natural-environment 
video sequences. In all cases, we computed the label fields 
It using simple background subtraction ([T]) with r = 40 
and background h updated with a between 10~^ and 10~^, 
depending on the sequence. Although we have performed 
experiments on a wide range of model parameters, we are 
presenting here the results for event model based on size 
descriptor © (Ai = A2 = 0). 

The results of behavior subtraction using the maximum- 
activity surrogate ([T2l) are shown in Figs. H]-??. Each result 
was obtained using a training sequence of length M=1000- 
5000 frames, w = 100, and 6 G [0.5,0.7]. As is clear from 
the figures, the proposed method is robust to inaccuracies in 
motion labels k. Even if moving objects are not precisely 
detected, the resulting anomaly map is surprisingly precise. 
This is especially striking in Fig. |4] where a highly-cluttered 
environment results in high density of motion labels while 
camera jitter corrupts many of those labels. 

Behavior subtraction is also effective in removal of un- 
structured, parasitic motion such as due to water activity 
(fountain, rain, shimmering surface), as illustrated in Fig. [Jl 
Note that although motion label fields It include unstructured 
detections due to water droplets, only the excessive motion is 
captured by the anomaly maps (passenger car and truck with 
trailer). Similarly, the shimmering water surface is removed 
by behavior subtraction producing a fairly clean boat outline 
in this difficult scenario. Our method also manages to detect 
abandoned objects and people lingering, as seen in the two 
bottom rows of Fig. [5] 

Fig. ?? shows yet another interesting outcome of behav- 
ior subtraction. In this case the background behavior image 



was trained on a video with single pedestrian and fluttering 
leaves. While the object- size descriptor captures both indi- 
vidual pedestrians and groups thereof, anomalies are detected 
only when a large group of pedestrians passes in front of the 
camera. 

The results of behavior subtraction using the average- 
activity surrogate are shown in Fig. ??. The video sequence 
has been captured by a vibrating camera (structural vibrations 
of camera mount). It is clear that behavior subtraction with 
average-activity surrogate outperforms background subtrac- 
tion based on single-Gaussian model [?] and non-parametric- 
kernel model [?]. As can be seen, behavior subtraction effec- 
tively eliminates false positives without significantly increas- 
ing misses. 

As already mentioned, the proposed method is efficient in 
terms of processing power and memory use, and thus can 
be implemented on modest-power processors (e.g., embed- 
ded architectures). For each pixel, it requires one floating- 
point number for each pixel of B and e, and w/8 bytes 
for This corresponds to a total of 11 bytes per pixel 
for w = 24. This is significantly less than 12 floating- 
point numbers per pixel needed by a tri-variate Gaussian 
for color video data (3 floating-point numbers for G, B 
means and 9 numbers for covariance matrix). Our method 
currently runs in Matlab at 20 fps on 352 x 240-pixel video 
using a 2.1 GHz dual-core Intel processor. More experimen- 
tal results can be found in our preliminary work [?], [?], 
while complete video sequences can be downloaded from 
www . dmi . us herb . ca/^ jodoin/pro jects/PAMI^OO 9. 

VII. Conclusions 

In this paper, we proposed a framework for the character- 
ization of dynamic events and, more generally, behavior. We 
defined events as spatio-temporal signatures composed of var- 
ious moving-object features, and modeled them using station- 
ary random processes. We also proposed a computationally- 
efficient implementation of the proposed models, called be- 
havior subtraction. In fact, due to simple surrogates of activ- 
ity/behavior statistics used, behavior subtraction is very easy 
to implement, uses little memory and can run on an embedded 
architecture. Furthermore, the proposed framework is content- 
blind, i.e., equally applicable to pedestrians, motor vehicles 
or animals. Among applications that can benefit from the 
proposed framework are suspicious behavior detection and 
motion detection in presence of strong parasitic background 
motion. Yet, challenges remain. One challenge is to extend 
the proposed concepts to multiple cameras so that a mutual 
reinforcement of decisions takes place; some of our prelimi- 
nary work can be found in [?]. Another challenge is to detect 
anomalies at object level while using only pixel-level decisions 
proposed here. 
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Fig. 5. Behavior subtraction results for maximum-activity surrogate {12) on video sequences containing shimmering water surface (two top rows), strong 
shadows (third row) and very small abnormally-behaving object (bottom row). 



